BLEU deconstructed: Designing a Better MT Evaluation Metric
نویسندگان
چکیده
BLEU is the de facto standard automatic evaluation metric in machine translation. While BLEU is undeniably useful, it has a number of limitations. Although it works well for large documents and multiple references, it is unreliable at the sentence or sub-sentence levels, and with a single reference. In this paper, we propose new variants of BLEU which address these limitations, resulting in a more flexible metric which is not only more reliable, but also allows for more accurate discriminative training. Our best metric has better correlation with human judgements than standard BLEU, despite using a simpler formulation. Moreover, these improvements carry over to a system tuned for our new metric.
منابع مشابه
PORT: a Precision-Order-Recall MT Evaluation Metric for Tuning
Many machine translation (MT) evaluation metrics have been shown to correlate better with human judgment than BLEU. In principle, tuning on these metrics should yield better systems than tuning on BLEU. However, due to issues such as speed, requirements for linguistic resources, and optimization difficulty, they have not been widely adopted for tuning. This paper presents PORT 1 , a new MT eval...
متن کاملExtending MT evaluation tools with translation complexity metrics
In this paper we report on the results of an experiment in designing resource-light metrics that predict the potential translation complexity of a text or a corpus of homogenous texts for state-ofthe-art MT systems. We show that the best prediction of translation complexity is given by the average number of syllables per word (ASW). The translation complexity metrics based on this parameter are...
متن کاملAn Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation
Automatic evaluation of machine translation (MT) quality is essential in developing high quality MT systems. Despite previous criticisms, BLEU remains the most popular machine translation metric. Previous studies on the schism between BLEU and manual evaluation highlighted the poor correlation between MT systems with low BLEU scores and high manual evaluation scores. Alternatively, the RIBES me...
متن کاملInterpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System?
Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a...
متن کاملFeasibility of Minimum Error Rate Training with a Human- Based Automatic Evaluation Metric
Minimum error rate training (MERT) involves choosing parameter values for a machine translation (MT) system that maximize performance on a tuning set as measured by an automatic evaluation metric, such as BLEU. The method is best when the system will eventually be evaluated using the same metric, but in reality, most MT evaluations have a human-based component. Although performing MERT with a h...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013